Dataset

The data set contains 4,898 white wines with 11 variables quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Reference: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Input variables (based on physicochemical tests):

  1. fixed acidity (tartaric acid - g / dm^3)
  2. volatile acidity (acetic acid - g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride - g / dm^3)
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate - g / dm^3)
  11. alcohol (% by volume)

Output variable (based on sensory data):

  1. quality (score between 0 and 10)

Overview of Distributions

We start by looking at histograms for every variable:

We notice that the vast majority of wines were assigned a rating between 5 and 7. There are no wines with ratings of 1, 2 or 10:

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Most variables are symmetricly distributed with high peaks. Residual sugar and alcohol are more right skewed. Most variables seem to have outliers on the upper scale (vol.acid, citric, sugar, chlorides, freeSO2, density). We will be mostly interested in examining relationships between quality and other variables. In order to get more information about potential outliers we compute means and quartiles.

##     fix.acid         vol.acid          citric           sugar       
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides          freeSO2          totalSO2        density      
##  Min.   :0.00900   Min.   :  2.00   Min.   :  9.0   Min.   :0.9871  
##  1st Qu.:0.03600   1st Qu.: 23.00   1st Qu.:108.0   1st Qu.:0.9917  
##  Median :0.04300   Median : 34.00   Median :134.0   Median :0.9937  
##  Mean   :0.04577   Mean   : 35.31   Mean   :138.4   Mean   :0.9940  
##  3rd Qu.:0.05000   3rd Qu.: 46.00   3rd Qu.:167.0   3rd Qu.:0.9961  
##  Max.   :0.34600   Max.   :289.00   Max.   :440.0   Max.   :1.0390  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
##  1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.180   Median :0.4700   Median :10.40   Median :6.000  
##  Mean   :3.188   Mean   :0.4898   Mean   :10.51   Mean   :5.878  
##  3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :3.820   Max.   :1.0800   Max.   :14.20   Max.   :9.000

Means and medias are usually very close supporting our observation of mostly symmetric and peaked distributions. All variables show a narrow interquartile range (IQR). Their maximum values on the other hand are quite extreme.

To measure how far they deviate from the majority of values we compute how many multiples of the IQR the maximum is away from Q3:

##  fix.acid  vol.acid    citric     sugar chlorides   freeSO2  totalSO2 
##  6.900000  7.090909 10.583333  6.817073 21.142857 10.565217  4.627119 
##   density        pH sulphates   alcohol   quality 
##  9.795545  2.842105  3.785714  1.473684  3.000000

The highest chlorides value is even more than 20 times the IQR away from the third quartile. Also, all other variables except for alcohol show outliers.

We start with the four variables with the most extreme outliers (chlorides, citric, freeSO2, density). We produce boxplots with and without outliers. An outlier is defined as a value falling outside the interval [Q1 - 2 x IQR , Q3 + 2 x IQR].

There are 172 data points with chloride values 2 times the interquartile range above the third quartile. Most of them with a quality of 5 and 6. Deleting the outliers would strengthen the correlation between quality and chlorides from -0.2099344 to -0.2767492, which would make it interesting to delete the outliers. Nevertheles, as we don’t have enough information about the data generation we abstain from deleting that many values.

There are 125 outliers on the upper scale. Deleting outliers would strengthen the correlation slightly, increasing it from -0.0092091 to 0.0184377. However, correlation between the two variables is rather weak and we decide to keep all values.

There are 26 outliers on the upper scale. Deleting outliers would strengthen the correlation from 0.0081581 to 0.0308116. One outlier is particularly extreme. However, as it also results in an ‘extreme’ rating, it would probably make sense to keep it.

There are 3 outliers on the upper scale. Deleting outliers would strengthen the correlation from -0.3071233 to -0.3172324. Similar to the variable free sulfur dioxide, there is one particularly extreme value. In contrast to the outlier before, the extreme value does not result in an ‘extreme’ quality rating. Therefore, we decide to discard the three highest values as they contradict the otherwise quite strong (negative) correlation with quality.

Summary Univariate Analysis

What is the structure of the dataset? Did you create any new variables from existing variables in the dataset? Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data?

The data set contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

We notice that vast majority of wines were assigned a rating between 5 and 7. There are no wines with ratings of 1, 2 or 10.

The chemical attributes show mostly symmetric and peaked distributions. Exceptions are the variables for residual sugar content and alcohol. Except for alcohol, all variables contain quite a few extremly high values. We decided to delete three outliers as they showed a deviation from the observed relationship between quality and density.

What is/are the main feature(s) of interest in your dataset?

We are interested in identifying the chemical properties of the white wines that could have influenced the quality rating. We will try to detect relationsships between the rating (variable “quality”) and the variables describing the chemical properties.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

There are a few obvious interpendencies between other variables (e.g. alcohol and density, residual.sugar and density). Further, (high) quality is probably not influenced by a single variable but rather a (optimal?) combination of chemical properties. Thus, it might be interesting to investigate not only bivariate but multivariate relationships.

Bivariate Plots Section

In order to get an overview of the data, we use ggpairs on a subsample:

The correlation factors for the full dataset (with the variable quality) are as follows:

##  [1] "fix.acid"  "vol.acid"  "citric"    "sugar"     "chlorides"
##  [6] "freeSO2"   "totalSO2"  "density"   "pH"        "sulphates"
## [11] "alcohol"   "quality"
##  [1] -0.113815100 -0.195886513 -0.009250724 -0.100114240 -0.210031177
##  [6]  0.008206527 -0.174835130 -0.317232428  0.099423895  0.053710048
## [11]  0.435842693  1.000000000

We can observe the strongest correlation with quality for alcohol and density. The pairwise plots show that alcohol is strongly correlated with density (approx. -0.8) and also residual sugar (approx. -0.46). Volatile acidity and chlorides give correlation coefficients of about -0.2. Total sulfur dioxide gives a correlation factor of -0.175. As we saw earlier, the lowest correlation is found for citric acid (-0.099). Other correlation factors are approx. between +/-0.1.

Let’s have a closer look at the variables alcohol, (residual) sugar and density, where we can expect a clearly visible relation.

## [1] "correlation coefficient"
## [1] -0.8041518

## [1] "correlation coefficient"
## [1] 0.8320888

As the density of alcohol is lower than the density of water, we can observe a very linear relationship between alcohol and density with a correlation coefficient of -0.804. On the other hand, sugar increases the density of water/wine, so that we see the same linear relationship with a clear upward trend (correlation coefficient of +0.832). Interestingly, it seems like this relationship does not hold for low sugar contents. But we have to keep in mind that the influence of sugar content is higher with increasing values. If the sugar content is close to zero, the influence of other variables (in particular alcohol) on density will be superior. We can visualize this effect by zooming in:

We see that on the same (low) sugar level, density varies with alcohol. Again, we can see that density is highest (keeping the amount of sugar fixed) for low alcohol levels.

Next, we investigate the relationship between quality and other variables. We will focus on the four variables with the strongest correlation, i.e. alcohol, chlorides, volatile acidity and total sulfur dioxide. We start with alcohol because it showed the strongest linear correlation with quality.

## Source: local data frame [7 x 3]
## 
##   quality mean(alcohol) median(alcohol)
## 1       3      10.34500           10.45
## 2       4      10.15245           10.10
## 3       5       9.80884            9.50
## 4       6      10.57648           10.50
## 5       7      11.36794           11.40
## 6       8      11.63600           12.00
## 7       9      12.18000           12.50

Looking at the means and medians we can see a linear increase along with quality of 5 or higher. The range of alcohol for a given quality rating is quite big and overlaps with values for other quality ratings.

## Source: local data frame [7 x 3]
## 
##   quality mean(chlorides) median(chlorides)
## 1       3      0.05430000             0.041
## 2       4      0.05009816             0.046
## 3       5      0.05154633             0.047
## 4       6      0.04519727             0.043
## 5       7      0.03819091             0.037
## 6       8      0.03831429             0.036
## 7       9      0.02740000             0.031

Means for chlorides (boxplots on a log scale) are almost stricly linearly decreasing with increasing quality. Medians, again, show a linear relationship for quality of 5 and higher. The IQR of chlorides overlap for different quality ratings.

## Source: local data frame [7 x 3]
## 
##   quality mean(totalSO2) median(totalSO2)
## 1       3       170.6000            159.5
## 2       4       125.2791            117.0
## 3       5       150.9046            151.0
## 4       6       137.0014            132.0
## 5       7       125.1148            122.0
## 6       8       126.1657            122.0
## 7       9       116.0000            119.0

The IQR overlap as for the previous plots. Once again, we can observe a break for the means of medians of total sulfur dioxide around wine quality between 4 and 5. Same holds for the following plot:

## Source: local data frame [7 x 3]
## 
##   quality mean(vol.acid) median(vol.acid)
## 1       3       0.333250             0.26
## 2       4       0.381227             0.32
## 3       5       0.302011             0.28
## 4       6       0.260180             0.25
## 5       7       0.262767             0.25
## 6       8       0.277400             0.26
## 7       9       0.298000             0.27

For the four variables with the strongest correlation with quality we could observe different behaviors for qualities above and below 5. That is why we group the quality ratings of 3-5 and assign it their median value 5 (mean: 4.8762195).

## Source: local data frame [5 x 3]
## 
##   quality.grouped mean(alcohol) median(alcohol)
## 1               5       9.84953             9.6
## 2               6      10.57648            10.5
## 3               7      11.36794            11.4
## 4               8      11.63600            12.0
## 5               9      12.18000            12.5

We can see that grouping the lower quality ratings into a single rating has the nice effect that means and medians are now strictly monotonic. Overlap of the IQR lessens. Nevertheless, the monotony only works on average.

## Source: local data frame [5 x 3]
## 
##   quality.grouped mean(chlorides) median(chlorides)
## 1               5      0.05143598             0.047
## 2               6      0.04519727             0.043
## 3               7      0.03819091             0.037
## 4               8      0.03831429             0.036
## 5               9      0.02740000             0.031

For chlorides the same effect can be produced at least for the median values. For total sulfur dioxide…

## Source: local data frame [5 x 3]
## 
##   quality.grouped mean(totalSO2) median(totalSO2)
## 1               5       148.5979              149
## 2               6       137.0014              132
## 3               7       125.1148              122
## 4               8       126.1657              122
## 5               9       116.0000              119

and volatile acidity…

## Source: local data frame [5 x 3]
## 
##   quality.grouped mean(vol.acid) median(vol.acid)
## 1               5      0.3102652             0.29
## 2               6      0.2601800             0.25
## 3               7      0.2627670             0.25
## 4               8      0.2774000             0.26
## 5               9      0.2980000             0.27

…it doesn’t work that well, but the trend becomes more visible. For total sulfur dioxide we could think about combining qualities of 8 and 9, but we don’t want to follow this path here.

Summary Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

We can observe some linear relationships between quality and other variables, in particular alcohol, chlorides, total sulfur dioxide and volatile acidity. On average, we can produce strictly monotonic relationships for at least two variables (alcohol and chlorides). In all cases, quality doesn’t seperate the levels of a chemical variable into distinct groups. Monotony can only be achieved on average.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There is a clear (linear) relationship between density and alcohol and density and sugar. We observed that for low sugar content, the influence of alcohol on the density becomes stronger.

What was the strongest relationship you found?

The strongest (and also most obvious) relationship is the one between residual sugar and density. Also, alcohol and density are strongly correlated, even though residual sugar has the stronger influence (as “adding” alcohol can only lower to density to density of alcohol itself).

The strongest relationship between quality and another variable is found for alcohol.

We will look at relationships of more than two variables in the next section.

Multivariate Plots Section

Before start examing the influence of other variables on the wine quality, we summarize the relationship between alcohol, sugar and density:

From the color coding, we can easily infer that density is highest when the amount of residual sugar is high and alcohol is low (and vice versa). For a fixed sugar level, density varies with the alcohol content. For a given alcohol level, density increases with increasing sugar content. The linear correlation between alcohol and sugar is weaker than the one between alcohol / sugar and density because there are a lot of wines with low sugar content, in fact half of the wines have got a sugar content below 5.2 (black line, we stretched out the low sugar levels using a log scale).

Let’s turn our attention back to the wine quality. In this section we are looking for interactions between the chemical attributes influencing the wine quality. So far we found significant relationships for alcohol, chlorides, total sulfur dioxide and volatile acidity with the white wine quality. Now, we would like to investigate how other variables (possibly) influence these realtionships.

Chlorides represent the amount of salt. We have seen that very high levels of chlorides tend to go hand in hand with lower quality. This might be offset by other variables. Two that come to mind are the amount of residual sugar (adding sweetness) and citric acid (which can add “freshness” and flavor to the wine).

We produce scatterplots of chlorides and sugar (medians in blue) for every quality level. We use the grouped quality assignment:

We cannot identify any clear interactions by adding sugar to our analysis. Let us look at citric acid. We cut off the upper and lower 5% to allow for better visibility:

We cannot identify any additional interactions by including citric acid into our analysis.

Earlier, we investigated the interactions between alcohol, sugar and density. Here, we want to look at the interactions between alcohol and sugar. During the fermentation process sugar is transformed into alcohol, so high amounts of residual(!) sugar may indicate an early stop of the fermentation process which would lead to a lower alcohol content.

Afterwards, we want to investigate how the two variables influence the wine quality.

Although we obtain a correlation coefficient of -0.4591654 (supporting our assumption), we see a diffuse pattern. Especially for low amounts of residual sugar, there seems to be no or only little influence on the alcohol content. Next, we want to investigate interactions between the variable regarding the wine qualilty. Again, we use a logarithmic scale for sugar because the distribution is heavily right skewed (see first section):

We fitted second degree (natural cubic) B-splines to reveal trends more clearly. On average, quality increases with alcohol and decreases with residual sugar content (especially very high sugar content is more often found for wines with quality of 6 or less) - so far so good. However, for sugar content between 3 and 10 (estimate) alcohol increases stronger with quality as for lower sugar levels. For higher sugar contents, alcohol even seems to decrease on average.

To display the different behavior for high sugar levels, we “cut” the variable in to four groups (0,2], (2,6], (6,10] and (10,max] and plot the distribution of alcohol across the wine quality:

We see that for medium sugar levels between 2 and 10 alcohol levels increase a little more than for lower levels. For sugar levels between 2 and 6 median alcohol content is already higher than 12 for a quality of 7. Remember that, considering only alcohol and quality, such high median alcohol level wasn’t observed for qualities less than 8. Here, alcohol increases even further for qualities of 8 and 9 in that range of sugar values. Even more surprisingly, for higher sugar levels the positive relationship (again: see bivariate section) is reversed.

Next, we have a look at volatile acidity again. High levels of acidity are often associated with a vinegary taste.

Very high amounts of volatile acidity are correlated with low quality (as we saw before). Looking at wines with quality 4 (for example), we see that high levels of volatile acidity cannot be offset by adding “freshness” in form of citric acids. We couldn’t find other variables that would do the job.

Lastly, we examine total sulfur dioxide and citric (similar for other variables):

As the data is too clustered, we cut off the upper and lower 5% quantiles:

For quality 3 and 9 there are not enough data points to be confident in the pattern. Across the other quality levels, we cannot detect any strong interactions.

Let us formulate our findings from this section. Afterwards we give a brief overall summary.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

We focused on the four variables from the bivariate analysis. We investigated their relationship with quality and tried to identify interactions with other variables. Only for alcohol we are confident to have identified some interactions with sugar. Alcohol and quality show different patterns especially for medium (2-10) and high (>10) sugar levels.

Were there any interesting or surprising interactions between features?

The alcohol distribution for white wines with high sugar levels is quite different than for lower sugar levels. This wasn’t expected.


Final Plots and Summary

Plot One

Description One

Most white wines obtain a rating between 5 and 7. Only very few ratings of 3 and 4 or 8 and 9 are assigned. There are no ratings less than 3 and no wine is rated 10. One might be more interested in determining how to identify a very good wine instead of a wine of average quality. So, it would have been useful to have some wines of quality 9 (or even 10). For some analysis it can be helpful to combine ratings of 3 to 5 to one group:

Plot Two

Description Two

Medians of the variable chlorides are strictly decreasing with increasing wine quality. The IQR of chlorides of course still overlap for different quality ratings. Most outliers are found for low and medium ratings. The correlation coefficient between quality and chlorides is -0.2100312 and with -0.2168668 even a little stronger when grouping the lower ratings. Chlorides and quality show the second strongest (linear) relationship (after alcohol).

Plot Three

We fitted second degree (natural cubic) B-splines to reveal trends more clearly. On average, quality increases with alcohol (correlation coefficient: 0.4358427) and decreases with residual sugar content (especially very high sugar content is more often found for wines with quality of 6 or less).

For sugar content between 3 and 10 (estimate) alcohol increases stronger with quality as for lower sugar levels. For higher sugar contents, alcohol even seems to decrease on average.


Reflection

The dataset contains almost 5000 white wines that were rated by three experts. Eleven chemical attributes like sulfur content, pH level etc. are listed.

In general, there are no striking linear correlations between wine quality and its chemical properties. Our visualizations suggest that at least alcohol and chlorides are significantly correlated to the quality of white wines. It is helpful to look at the full range of chmecial variables, e.g. very high amounts of chlorides or volatile acidity seem to have a negative impact on the quality. Wine quality is centered around medium ratings and ratings of 9 are rare. There are no wines that obtained a rating of 10. Therefore, we focused on overall trends. Among the other variables strong relationships can be found (and easily explained, e.g. more sugar increases the density).

Weaker relationships between the quality and the chemcial attributes could be found. This is little surprising because we can hardly expect to perfectly model (the only little understood and very complex sense) human taste with only eleven variables.

More information can be extracted by looking at combinations of chemical variables. Here, we found the combination of alcohol and sugar to give further insight. Other combinations did not show interactions.

Our analysis suggests four main points:

For a better understanding of wine quality more chemical properties are needed. As human taste is very complex, it might be useful to include other non-chemical attributes, like wine type, location, hours of sunshine etc.

Also, I think it would be very interesting to include price as a variable. This could give the analysis a whole new perspective.